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Abstract 

Background: Current sequencing technology makes it practical to sequence many samples of a given organism, 
raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. 
Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and 
using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism 
mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the 
ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be 
associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping 
is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, 
genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic 
genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. 
Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single- 
processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation 
DNA sequencing machines, because they increase in computational complexity quickly with the number of 
sequences. 

Results: Nephele is a suite of tools that uses the complete composition vector algorithm to represent each 
sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence 
alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance 
measure over the vectors. Our methods produce results that correlate well with expert-defined clades or 
genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. 
Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple 
compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 
hours. 

Conclusions: We conclude that using Nephele can substantially decrease the processing time required for 
generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage. 



Background 

In the post-genomic era, as sequencing becomes ever 
cheaper and more routine, biological sequence analysis 
has provided many useful tools for the study and com- 
bat of infectious disease. These tools, which can include 
both experimental and computational methods, are 
important for molecular epidemiological studies [1-3], 
vaccine development [4-6], and microbial forensics 
[7-9]. One such method is genotyping, the grouping of 
samples based on their genetic sequence. This can be 
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done experimentally [10-12] or computationally, either 
by identifying genetic signatures (nucleotide substrings 
which are only found in a single group of sequences) 
[13], or on the basis of genetic distance among the 
sequences [14-16]. These methods allow a researcher to 
split a group of sequences into distinct partitions for 
further analysis. In a forensics context, genotyping a 
sequence can yield clues on where the sequence comes 
from. In surveillance, genotyping can be used to exam- 
ine the evolutionary footprint of a pathogen, for exam- 
ple, to identify areas where certain vaccines and other 
countermeasures should be used. 
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Sequence-based comparison involves three major 
steps. The first is to choose a set of sequences to study, 
based on some criteria, such as strain, time period or 
geographic region. Ideally, this set can be easily 
extracted from a well-populated reference database, con- 
taining not only the sequence data for the samples of 
interest, such as a particular serotype of Influenza, but 
also sufficient metadata. For infectious diseases, types of 
metadata include geospatial and temporal co-ordinates, 
host information, and pathogenicity. Once the appropri- 
ate dataset is chosen, the samples are compared and 
clustered in sequence space. From here, the metadata 
associated with the sequences is used to assess the evo- 
lutionary landscape of the organism or pathogen [17]. 

Sequence Comparison Methods 

Traditionally, the first step in performing sequence com- 
parisons is to generate a multiple sequence alignment 
(MSA) from the sequences of interest. This is most 
often done using heuristics found in utilities such as 
CLUSTAL W [18], MUSCLE [19], T-COFFEE [20], and 
ProbCons [21]. The dynamic programming solution, 
which can find the mathematically but not necessarily 
biologically correct solution, quickly becomes impracti- 
cal with the sample sizes used in any meaningful analy- 
sis. A recent review [22] examined many of the issues in 
producing these alignments, most notably the trade-offs 
between alignment accuracy, time, and computational 
expense. Many of the most accurate algorithms cannot 
be used on a large number of sequences, or on very 
lengthy sequences, and were only recommended for sets 
of less than 100 sequences. 

Because the alignment is dependent on each of the 
sequences from which it is calculated, the alignment 
must be recomputed whenever a new sequence is 
added. This becomes problematic for surveillance appli- 
cations, where new sequences will be added constantly. 
While this problem has been mitigated to some extent 
using with algorithms such as Near-Alignment Space 
Termination (NAST) [23], this still adds a level of com- 
plexity if the dataset is continually growing, as is the 
case with Influenza and other infectious diseases. 
Another issue is that different heuristics will yield differ- 
ent alignments - they are only designed to find an 
acceptable answer, not the optimal alignment. While 
methods have been developed to find a "consensus 
alignment" [24] from a set of alignments, this requires a 
good deal of time and computing power. 

The composition vector (CV) [25] method has been 
used to describe DNA/RNA and protein sequences as 
vectors, using the distance between these vectors as the 
genetic distance. This method involves using a sliding 
window to represent each sequence as a vector, where 
each element of the vector is calculated based on the 



actual and expected frequency of the k-mer (DNA/pro- 
tein subsequence of length k) observed in that window. 
The vector representation allows the distance between 
two sequences to be calculated with any standard dis- 
tance metric. The CV method was shown to produce 
trees which matched established taxonomies, as inferred 
from the 16S RNA segment by more conventional align- 
ment-based methods [26]. The CV method was later 
expanded into the complete composition vector (CCV) 
method [27], which uses sliding windows over a range 
of lengths to describe the sequence. Since these methods 
do not require alignments to be calculated, distances 
calculated between sequences remain constant, rather 
than being dependent on the set of sequences being 
examined, making these methods ideal for the handling 
of rapidly growing datasets. No molecular models need 
be used to calculate distances - distances are calculated 
using any distance metric that can be used to calculate 
the distance between vectors. 

The next step in sequence analysis is the clustering of 
the sequences. Traditionally, this is done by inferring a 
phylogenetic tree. Tools for this purpose include PHY- 
LIP [28], PAUP* [29], or POY [30]. This work was initi- 
ally performed using distance-based methods, such as 
the UPGMA or neighbour-joining algorithms [31], or 
cladistic methods such as Maximum Parsimony. As 
computational power increased, methods that inferred 
trees based on models of evolution were used. These 
include the Maximum Likelihood technique, as well as 
Bayesian Inference. While these methods produce phylo- 
genetic trees, which provide a useful visualization, any 
further analysis and grouping must be performed manu- 
ally. As the number of sequences to compare increases, 
this becomes more and more difficult. In fact, there has 
been much recent research into new methods to visua- 
lize phylogenetic trees with large numbers of leaves 
[32,33]. In addition, the phylogenetic tree view proves 
difficult to integrate with the metadata. For example, a 
recent paper discussing the spread of H5N1 Avian Influ- 
enza used Google Earth to draw a phylogenetic tree on 
top of the globe [34]. While this visualization works well 
for a small number of samples, it is ineffective for larger 
datasets, due to the "busyness" of the visualization. 

Computational Genotyping 

An alternative to the pure phylogenetic approach is 
computational genotyping. This involves partitioning the 
set of sequences into discrete groups, based on some 
criteria. This can be based on differences between 
known subtypes, such as tandem repeats or single 
nucleotide polymorphisms, or by genetic distance. In the 
case of genotyping based on distance, this becomes a 
clustering problem. In 2007, Frey and Dueck published 
a paper on a new clustering algorithm known as affinity 
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propagation clustering [35]. In contrast to other cluster- 
ing algorithms, such as k-means and Expectation Maxi- 
mization (EM), the affinity propagation algorithm does 
not require the user to explicitly select a given number 
of exemplars at the start of clustering. Instead, affinity 
propagation simultaneously considers all points as 
potential exemplars, using an initial preference to deter- 
mine the sensitivity, and therefore the number of clus- 
ters. This eliminates the need for large numbers of runs 
to determine the ideal number of clusters and any 
dependence on initial conditions seen in other partition 
clustering algorithms. Furthermore, this algorithm 
allows the user to set the preference for each data point. 
This is useful for a scenario where a partial set of repre- 
sentative samples are known, but there may be other 
exemplars along with these in a data set. The affinity 
propagation has been tested on geospatial, text, and 
gene expression data and showed improvements in both 
speed and accuracy over other clustering algorithms. 

The main advantage of an automated computational 
genotyping method is that it gives the researcher the 
ability to combine a measure of sequence similarity 
(cluster membership) with the metadata. It is this meta- 
data that yields the most information about a sample. A 
phylogenetic tree will tell what samples are close in 
sequence space, but any further inference is made using 
the metadata. By separating the sequences into discrete 
groups, the researcher is given much more flexibility to 
visualize the data and associated metadata. 

MapReduce 

MapReduce [36] is the software framework developed by 
Google™ to support parallel distributed execution of 
their data intensive applications. MapReduce is designed 
for fault-tolerant computations with extremely large 
datasets. MapReduce is divided into two major phases 
called map and reduce, separated by an internal shuffle 
phase of the intermediate results. Hadoop is an open- 
source version of MapReduce implemented in Java and 
sponsored by Amazon™, Yahoo™, and other major 
vendors. Recently, MapReduce has been used for 
sequence and phylogenetic applications. For example, 
CloudBurst uses Hadoop for parallel short read-mapping 
for use in a variety of biological analyses including SNP 
discovery, genotyping, and personal genomics [37]. The 
Genome Analysis Toolkit uses the MapReduce paradigm 
for shared memory platforms [38]. MrsRF (MapReduce 
Speeds up RF) is a multi-core, multi-machine algorithm 
that generates t x t Robinson-Foulds distance matrix 
between t trees [39] using Phoenix [40], a MapReduce 
implementation for shared memory multi-core platform, 
and OpenMPI [41]. These uses indicate that MapReduce 
is a promising tool to help solve the computational chal- 
lenges with large datasets. 



Nephele 

In this paper, we describe a scalable complete genotyping 
system that brings together the complete composition 
vector and affinity propagation algorithms to produce 
genotypes from Influenza A sequences. The system has 
been tested on a variety of Influenza A and Actinomycetes 
genome data. In addition to providing discrete clusters 
representing genotypes, we use methods that produce 
trees that closely match the topologies of trees inferred 
using traditional phylogenetic methods, in order to pro- 
vide scientists with a more familiar visualization. 

Implementation 

Datasets 

The Influenza dataset used to develop our methods was 
that of Holmes et al. [42]. The clades and reassortment 
events found in these samples were discussed in detail, pro- 
viding eight sets of sequences (one for each gene studied in 
the paper) for verification of our methods. This dataset 
consists of 155 samples, taken from New York State during 
the 1999-2000, 2001-2002, 2002-2003, and 2003-2004 flu 
seasons. The complete coding sequences are available, as 
well as the date and county of collection for these strains. 
We also used HA segments from H1N1 (1141) and H3N2 
(2201) parsed from GenBank's viral division (gbvrl). For 
testing our implementation, we used 10,270 16S samples 
from GreenGenes (core_set_aligned.fasta retired on 07 Feb- 
ruary 2007; http://greengenes.lbl.gov/). 

To test our methods, two additional datasets were 
identified. A set of 94 sequences representing WHO 
expert-defined genotypes (http://www.who.int/csr/dis- 
ease/avian_influenza/guidelines/nomenclature/) was 
used to validate our methods, and another dataset repre- 
senting an 2007 Influenza outbreak in Europe was cho- 
sen to demonstrate the utility of the computational 
genotyping approach for microbial forensic analysis 
(Additional File 1). 

We also used 27 full length genomes of Actinomy- 
cetes bacteria from the Broad Institute along with their 
computed concatenated protein sequences, downloaded 
from the Tuberculosis Database (TBDB) [43]. 

Complete Composition Vector 

The method used is based on that of Wu et al. [44]. 
Each sequence, S, of a given length L, can be broken 
into L - k + 1 overlapping substrings of length k. For 
each substring a, the probability of occurrence is calcu- 
lated as 



where f(a) is the frequency of substring a in S. Next, 
the expected probability, q is calculated using a Markov 
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model described by Brendel, Beckmann, and Trifonov, 
which takes into account the probabilities of length- (k- 
1) and length(k-2) strings [45]. 



p{aia 2 ■ ■ ■ afe-i)p(a 2 ^3 ■ ■ ■ ofr) 
p{a 2 a 3 . . .a k -i) 



This is designed to highlight the role of selective 
mutation, and it was found that phylogenetic trees pro- 
duced without subtracting the background via the Mar- 
kov model were not consistent with traditional 
approaches [25]. 

The composition value, tt, for substring a is defined 
as: 



7i (a) 



0 cf = 0. 



The k th composition vector, V k (S), is comprised of the 
composition values for all possible substrings of length 
k. For amino acid sequences, V is of length 20 k, and 
for DNA/RNA, V is of length 4 k. This method has 
been shown to produce trees which match known taxo- 
nomies [26]. 

In 2004, Wu et al. extended the CV approach into the 
complete composition vector (CCV) [27]. This method 
combines the composition vector approach with the 
idea of the complete information set, in order to supple- 
ment any information loss from the background sub- 
traction in the CV method [46]. The CCV is defined as 
the sequence of composition vectors from 3 to M, 
where M is a pre-determined constant. 

For all experiments described in this paper, the com- 
plete composition vectors were calculated with M = 9. 
In addition, the revised relative entropy string selection 
string scoring scheme described by Wu et al. [44] was 
employed to reduce the dimensionality of the vectors. 
This is calculated as 



En . . 
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where II represents the complete composition vector 
calculated from the concatenation of all n sequences in 
the dataset. In summary, this method evaluates the infor- 
mation content associated with each possible substring, 
and the most informative substrings are chosen for inclu- 
sion in the analysis. The number of n-mers used for dis- 
tance calculations was chosen based on the dataset: if the 
absolute revised relative entropy was below 1.0, the sub- 
string was not used for any further calculations. 

Once the final set of n-mers is chosen, the vectors are 
normalized by calculating the Z-score for each n-mer. 
From these normalized vectors, the distance matrix is 



then calculated. For each pair of samples, the distance 
between the normalized complete composition vectors 
Vi and Vj is calculated using cosine distance: 



Wi\ Vj 



+ 1 



Dij 



We also experimented with using the Euclidian dis- 
tance, calculated as 



where n is the number of substrings kept after the 
substring selection (Figure 1). The CCV and distance 
calculation code was written in Java (1.5+), using cus- 
tom classes to save space and memory. Experiments 
were run on Apple dual quad-core Intel Mac Pro with 8 
GB running OS x 10.6. Additional testing was done 
under CentOS 5.5 and Ubuntu 9.4 Linux distributions. 

Affinity Propagation Clustering 

The input to the affinity propagation clustering algo- 
rithm is a similarity matrix. For Euclidian and Manhat- 
tan distances, the similarity is represented by the 
negative of the distance, while for cosine distances, the 
similarity is found by subtracting the distance matrix 
from 1. To determine the optimal preference, the 
mean silhouette value was used. This value is a mea- 
sure of how similar a given sample is to others in the 
same clusters, versus samples found in other clusters. 
It ranges from 1 (sample is well-clustered) to -1 (the 
sample is found in an incorrect cluster) and is calcu- 
lated as: 



5(Z) = 



h - Oj 



Where ai is the sample's average distance to the other 
samples in its cluster and bi is the minimum average 
distance between the sample and the samples in each of 
the other clusters. The developers of the algorithm 
recommend using the minimum similarity between sam- 
ples for a low number of clusters, and the median simi- 
larity for a moderate number of clusters. The preference 
resulting in the optimal partitioning, using the average 
silhouette value as a measurement was chosen from a 
set of four preferences spanning the minimum and med- 
ian similarity. Affinity propagation was performed using 
the MATLAB function available at the authors' website 
(http://www.psi.toronto.edu/affinitypropagation/), with 
the default parameters or with our re-implementation 
written in Java as part of Nephele. 
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Parallelization with Hadoop 

We also implemented most of our CCV code for 
execution as a series of nine MapReduce jobs using 
the open-source MapReduce implementation Hadoop 
(http://hadoop.apache.org/). MapReduce is not ideal 
for the generation of neighbour-joined trees or affinity 
propagation. The MapReduce paradigm depends on 
the maps not dependent on any other data than what 
they are given. Both the neighbour-joined trees and 
affinity propagation algorithms depend on shared 
states, which breaks the MapReduce paradigm. How- 
ever, Nephele provides a Message Passing Interface 
(MPI) version of Panjo, a neighbour-joining algorithm, 
that is able to handle very large trees [47]. Our version 
accepts row packed matrices instead of column packed, 
because it was easier to generate them as opposed to 
column packed matrices using MapReduce. All experi- 
ments were run on a Rocks (http://www.rocksclusters. 
org/) cluster running CentOS 5.4 on Intel Core 2 
Quad and Core 2 Duo processors with 8 and 4 GB of 
memory, respectively, using Java 1.5 and Hadoop 
0.20.1. 



Results and Discussion 

Genotyping the New York Dataset: Results, Computation 
Time and Choice of Distance Metric 

The dataset used by Holmes and colleagues to study 
reassortment events throughout New York State pro- 
vided a dataset to use to build and refine the genotyping 
methods. This set included 155 full genomes of H3N2 
found in New York state between 1999 and 2004, col- 
lected as part of the Influenza Genome Sequencing 
Initiative [48], a worldwide sequencing initiative (this 
project has also sampled from the southern hemisphere, 
in Australia and New Zealand). Since the complete gen- 
ome for each of the samples in this set was sequenced, 
this provided genes, with differing rates of evolution to 
test the pipeline. 

Clustering was performed on the eight genes studied 
in detail in the paper (HA, Ml, NA, NP, NS1, PA, PB1, 
PB2) as described in the Implementation section. Clus- 
ter counts ranged from 5 (NP, PB2) to 15 (PB1), with 
clusters ranging in size from 1 (representing an outlier 
in the dataset) to 36. The results from the affinity pro- 
pagation clustering matched the clade structure of the 
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trees. The trees for all eight genes, colored by cluster 
membership, can be seen in Additional File 2. All phylo- 
genetic trees in this work were produced using Tree- 
ViewJ [49]. The segmented nature of the Influenza 
genome adds a level of complexity to the genotyping 
problem. For each sample, the set of clusters for the 
eight genes can be used to define a cluster "profile," 
which represents the genotype defined by the complete 
genome for that sample. This profile represents the 
composite genotype for that sample. For the 155 sam- 
ples in the dataset, 32 genotypes were identified. From 
these, the groups of samples identified by Holmes and 
colleagues to be involved in reassortment events were 
found as distinct genotypes. 

One of the major advantages of the complete compo- 
sition vector approach over traditional phylogenetic 
methods is the speed of analysis. For the individual gene 
segments in the test dataset (155 sequences), execution 
times ranged from 1.50 minutes (Ml) to 2.25 minutes 
(NA) In contrast, alignment times using MUSCLE were 
on the order of 5-10 minutes, and inference of maxi- 
mum likelihood trees took roughly 20 minutes per gene 
If the genes are concatenated together to create a full 
genome sequence, the gains are even more impressive - 
trees were produced in a few minutes with the CCV- 
based approach, rather than hours for traditional align- 
ment-based methods. This is consistent with initial 
results from composition vector based approaches, 
which focused on inferring trees for complete prokaryo- 
tic genomes [26]. 

In order to determine the ideal distance metric to use 
for clustering, the patristic distances (distances between 
leaves along the branches) of the phylogenetic trees 
inferred using the neighbour-joining algorithm on dis- 
tances calculated using cosine and Euclidian distances 
were compared (see Implementation section for details 
of computation). Patristic distances (the distance 
between two leaves along the branches of a tree) were 
calculated using the TreeDistanceMatrix methods from 
the Phylogenetic Analysis Java Library [50]. Patristic dis- 
tances for each gene and distance measure were plotted 
against each other. Figure 1 shows the patristic distance 
(distance between leaves along a tree) plots for HA and 
Ml, which represent rapidly mutating and slowly mutat- 
ing genes, respectively. It is clear that trees produced 
using the neighbour-joining algorithm on cosine dis- 
tance matrices produce trees that are the most similar 
to the Maximum Likelihood trees, while the Euclidian 
distance metric produces trees which have overly large 
distances near the leaves of the tree, as shown in Figure 
2. The cosine distance is shown to produce trees whose 
patristic distance has a linear relationship with that of 
the tree produced by maximum likelihood, while the 
trees produced using Euclidean distance show a higher- 



order relationship. These indicate that while the Eucli- 
dian distance has been used as a distance metric for the 
majority of previously published work involving compo- 
sition/complete composition vectors [15,44], it appears 
that cosine distance provides a better correlation with 
trees produced by traditional phylogenetic methods. 

Clustering on H5N1 Standard Nomenclature Dataset for 
Validation 

In 2001, the World Health Organization (WHO), along 
with the World Organization for Animal Health (OIE) 
and Food and Agriculture Organization of the United 
Nations (FAO) released, in poster form, a standard 
nomenclature system for the various lineages of Influ- 
enza H5N1 found in over 50 countries throughout the 
world This nomenclature is intended to replace the cur- 
rent nomenclature used in publications, where samples 
are often identified by the location of the earliest sample 
with the closest genetic similarity (for example, "Fujian- 
like" or "Quinghai lineage"). Alignments of 904 HA 
sequences were created, and clades were chosen from 
the tree based on a set of rules. These clades, developed 
to define a new standard nomenclature, provided an 
opportunity to blind test set our genotyping system. 

In addition to the complete 904 sample dataset, the 
authors provided a smaller, 109 sample representative 
dataset. Of these, we were able to find 94 which were in 
Genbank, and thus were available, with metadata, in our 
database. We ran our genotyping pipeline on these 
sequences, and found 18 clusters, as opposed to the 19 
clades found in the nomenclature study. The phyloge- 
netic tree, colored by cluster membership, is shown in 
Figure 3. To compare the results of our genotyping 
pipeline with the expert-defined genotypes, we used the 
Adjusted Rand Index [48], which has an expected value 
of zero, and a maximum value of 1. The Adjusted Rand 
Index for this experiment was 0.833, indicating a strong 
agreement between our results and the clades defined 
by the WHO/FAO/OIE. 

We also performed a detailed examination of the 
trees produced by the CCV method with those from 
the study. We found that members of Clade 2.3.1 were 
found in two distinct groups on our tree, one of which 
was quite distant from the rest of the samples in the 
tree. Upon looking at the sequences, we found that 
these were much shorter (-1000 bp) than the rest of 
the sequences (-1600 bp), indicating that these 
sequences were most likely HA1 sequences, rather 
than the full HA coding sequence, even though they 
were labelled full HA. This highlights a problem with 
the quality of data that currently exists in the data- 
bases. These inconsistencies in the data can signifi- 
cantly distort the results of the various sequence 
analysis methods. 
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Bacterial Genomes 

We investigated is our implementation can be used for 
larger and more complex genomes. We acquired 27 full 
length Actinomycetes genomes from the Tuberculosis 
Database at the Broad Institute and ran them through 
our pipeline. We were able to produce trees that had 
the same topology as those generated by the Broad in 
about 30 minutes compared to several hours for them 
using traditional tools (Brian Weiner, personal commu- 
nication, from unpublished data). In addition, we com- 
pared the trees produced using the concatenation of the 
predicted proteins of the same set of genomes and we 
got similar results in both time to produce and the 
topology of the trees. However, it should be noted that 
the length of the branches are different. 



Map Reduce 

During the development of the CCV code, we ran into 
memory bottle necks that required extensive coding to 
minimize. In addition, it was noted that several of the 
steps could be parallelized. We examined Hadoop to 
determine if we could utilize it for parallelizing our code 
across commodity hardware in a fault-tolerant way. We 
were able to code most of our algorithms using Hadoop. 
The few that we did not code were the neighbour-join- 
ing tree and the affinity propagation clustering algo- 
rithms. We provide a modified version of Panjo [47], a 
neighbour-joining algorithm, that uses the output of our 
Hadoop cosine distance matrix, which is in row major 
(packed) order. We also can output the matrix in the 
Phylip square format. We were able to generate a 
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Figure 3 Clustering of Influenza Dataset. Phylogenetic tree of the WHO Dataset, colored by cluster membership. The shorter HA1 sequences 
are boxed in red. 



neighbour-joined tree for 10,270 16S samples in 106 
minutes using our Rocks Cluster of 30 machines, which 
we were not able to compute at all using our code on a 
single machine. 

Conclusions 

We have described a fast and accurate method for com- 
putational genotyping, using both human and avian 
Influenza as a model organism, full length Actinomycetes 
genomes, and 13 S samples. This method utilizes 



techniques that are faster than traditional methods for 
both sequence comparison and clustering. Our method 
produces genotypes that closely match those produced 
by expert analysis. In addition to providing discrete gen- 
otypes with minimal human intervention, the complete 
composition vector based method produces trees that 
correlate highly with those produced by sequence align- 
ment and maximum likelihood methods, giving scien- 
tists a visualization of the data that they are familiar 
with in a fraction of the time. Possible uses of these 



Colosimo et al. Source Code for Biology and Medicine 201 1, 6:13 
http://www.scfbm.0rg/content/6/l /1 3 



Page 9 of 10 



A/Migratory Duck/Jiangxi/2295/2005 

A/Bar-headed Goose/QingTiai/75/05 
1 A/Bar-headed Goose/Qinghai/12/05 
A/Bar-headed Goose/Qinghai/60/05 
ri A/Bar-headed Goose/Qinghai/61/05 
■ — "-A/Brown-headed Gull/Qinghai/3/05 
A/Bar-headed Goose/Qinahai/67/05 
1 -A/Environment/Qinghai/31/2005 
A/Bar-headed Goose/Qinghai/59/05 
A/Bar-headed Goose/Qinghai/68/05 
A/Bar-headed Goose/Qinghai/5/05 
" Bar-headed Goose/Qinghai/65/05 
ir-headed Goose/Qingnai/62/05 
^-.k/Jjangx!/21 36/2005 



K/Bar-\ 
|A/Mjc 



Li A/Migratory Duck/JiangxT/2136/2i 
WMigratory Duck/Jiangxi/2300/2u 

y A/chicken/OmskT1 4/05 

— A/grebe/Novosibirsk/29/2005 
A/Novosibirsk/02/05 

-A/duck/Novosibirsk/56/2005 
- A/goose/Cri mea/6 1 5/05 
A/wild duck/Omsk/1 03-01/05 
— A/chicken/Crimea/04/200 
A/chicken/Crimea/08/2005 
"A/Guinea fowl/Shantou/1 341/2006 




05 



(A/el 
j |Wd 



A/whooper swan/Germany/R88/06 
n A/common buzzard/Germany/R306/06 

1 A/duck/Germany/R751/06 

-I A/mute swan/Germany/R65/06 
^A/swan/Germanv/R6572006 
■A/duck/Germany/R338/06 
■ A/coot/Germany7R655/06 

A/Canada goose/Germany/R1 207/06 

L|A/cat/Germany7R606/06 
Wcat/Germany/606/2006/H5N1 
A/perea ri ne/De n ma rk/6632/06 

wey lag, aoose/penmark/6692/06 
. yrjuzzard7Denmark/6370/06 
A/tufted duck/Denmark/6431/06 
A/guinea fowl/Nigeria/957-1 2/2006 
A/chicken/Nigeria/1 047-8/2006 
A/chicken/Nigeria/1 047-30/2006 
A/chicken/Nigeria/957-20/2006 
— A/durk/Niapr/91 4/2006 




^chick d e°n7Kras C noi 
A/chicken/Adygea^O^L _ 
' "lack-headed Gull/Qinghai/2/05 



oose/Pavlodar/1/2005 
' r/01/2006(H5N1) 



o/1 347-1 6/2006 



A/guinea fowl/Burkina Faso/1 347-20/2006 
— A/duck/Cote d'lvoire/1 787-1 8/2006 



. Vchicken/lvory Coast/1 787-35/2006 
. Vchicken/Nigerfe/1 047-62/2006 

- A/ch icken/N igeria/64 1 /2006 

=»p r «r (H5Ni) 

V/chicken/Kurgan/05/2005 

- A/wnooper swan/Mongolia/6/05(H5N1 ) 



H-T-A/ 



A/whooper swan/Mongolia/3/05(H5N1 ) 
— A/whooper swan/Mongolia/4/05(H5N1 




A/buzzard/Bavaria/5/2006 
|— A/swan/Bavari.a/6/2006 



'2006 

,~ . ted d I uck/bavaria/«/2' « * 
A/pochard/Germany/R34L. 
i A/swan/Bavaria/2 1 /2006 
A/common pochard/Switzerland/V505/200i 
H A/common pochard/Switzerland/V592/200i 
'A/mute swan/SwitzerlandA/68/2006 
A/goosander/Bavaria/20/2006 
A/swan/Bavaria/1 7/2006 
A/goosander/Bavaria/1 8/2006 
A/mallard/Bavaria/1/2006 
A/duck/SwitzerlandA/487/2006 
A/buzzard/Bavaria/1 3/2006 
A/duck/Germany/R603/06 
A/duck/Germany/R592/06 
■A/duck/Switzerland/V426/200r 
"A/aoldeneye duck/Bavaria/19 
■—A/tufted duck/Bavaria/9/200l 
A/tufted duck/SwitzerlandA/504/06 



2006 



A/mallard/ltaly/835/2006 



rA/cat/Austria/649/2006 
Vswan/Sloyenia/760/2006 



. Vswan/Austria/2 16/2006 



4« 

-Afmrmrfrm. 



oose/Hungai 
e/Hungai 
'Enalai 



4U|N1) 



-A/teal/Egypt/14051-NAMRU3/2005 

_| ^A/great black-headed gull/Qinghai/1/2005 

~~ I — r A/black-headed gull/Qinghai/1/2005 
L A/black-headed goose/Qinghai/2/2005 



rA/Turkey/65 1242/2006 
H^— A/domestic goose/lraq/81 2/2006 
^-A/human/lraq/207/2006 H 



Figure 4 Integration of Metadata With Genotyping Results. 

Clustering of HA genes from the 2007 United Kingdom H5N1 
outbreak with other samples isolated around the same time. This 
shows how researchers can combine genotyping with the 
metadata. Colors on the map represent the cluster membership. 
The red box indicates the cluster containing the UK Turkey Sample, 
along with the two Hungarian samples. Note the dates for the UK 
and Hungary samples (Light Blue) - these dates were only provided 
at the year level in Genbank, even though more accurate dates can 
be inferred from other sources. 



tools include displaying the genotypes and associated 
metadata on a timeline or map, to show the geospatial 
and temporal distribution of the pathogen population 
(Figure 4). Finally, our MapReduce implementation 



should handle tens of thousands of bacterial size gen- 
omes and genomes of complex Eukaryote organisms (we 
have tested this with several Fusarium sp. and got simi- 
lar trees, data not shown), such as those being produced 
from current and next generation sequencers, providing 
a method to analyze these large datasets. 

Availability 

Project name: Nephele 
Project home page: http://code.google.eom/p/nephele/ 
Operating system: Linux, Mac OS X, Unix 
Programming language: Java and C 
License: Apache License 2.0 

Additional material 



Additional file 1: Representative Standard Nomenclature Dataset of 
H5N1 Genotypes. A set of 94 sequences representing WHO expert- 
defined genotypes (http://www.who.int/csr/disease/avian_influenza/ 
guidelines/nomenclature/). 

Additional file 2: Clustering of Eight Genes from Influenza H3N2 
Viruses (HA, Ml, NA, NP, NS1, PA, PB1, PB2). This dataset consists of 
155 samples, taken from New York State during the 1999-2000, 2001- 
2002, 2002-2003, and 2003-2004 flu seasons. 
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